July 29, 2006 10:30:18pm
Re: software specs
Publication of the Petabox software The Archive uses is forthcoming, pending a more complete and formal packaging of the software. But in the meantime we can talk about it informally. The Petabox hardware can of course run any of a number of software environments, but I will talk about The Archive's environment here.
The operating system is linux. The distribution used to be debian-stable, which runs the most stable versions of all packages, but the developers decided they really needed more recent versions of some packages so we are now running on ubuntu (which can be viewed as debian-unstable plus several more hours of debugging and QA). The kernel version is 126.96.36.199.
Monitoring is currently done via Nagios, which runs sixteen tests per storage node: backup/primary status, apache error log, disk error, disk health, disk sector, disk SMART, disk space, disk webgroup, http, udp-locator, system load, temperatures (disks and cpu), ping connectivity, http connectivity, ssh connectivity, and rsync daemon running.
We've tried adding more monitoring, including Ganglia and Munin, but have thusfar haven't made it official, mostly due to lack of manpower and carry-through. We're running on a really skeletal administration team, and everyone's kept busy and stretched thin.
There is also a system status reporting function in the Catalog, which is the application we use to schedule and run most of the data clusters' Archive-specific functionality. Every job scheduled through the catalog shows up on an html table, along with the name of the datanode it's running on and its current status. By clicking on the datanode's name, a script is executed on the datanode via http, running ps, top, sar, df, and vmstat, providing a quick glance at the node's current system state.
Administration is partly done by centralizing many system functions onto a single big server (the homeserver) from which the datanodes are slaved. The rest is done via administrators ssh'ing around. I'm not a big fan of this approach, preferring decentralized logic and automated correction of broken nodes upon detection, but it's not up to me. It helps to have a tool which the sysadmins use to launch many ssh commands in parallel (like, "mount -a" on a couple hundred hosts at a time, in a single command). If you install Ganalia for monitoring, you will get gexec which serves this function.
All nodes run pure-ftp, rsync, apache 1.3.33, smartd, openssh, a location server (proprietary), Darwin streaming server, and nrpe (for nagios).
There are four hard drives per 1U machine, and they are not RAIDed. Disk hda is divided into system-related filesystems: /, /var, /usr, and /tmp, with the balance of the disk's space put into data filesystem /0. The other three disks have one filesystem each, mounted on /1, /2, and /3. All filesystems are reiserfs (except /tmp, see below), but if I were making a new system I'd probably go with xfs instead. It performs better and provides superior administration tools (like xfsdump and xfsrestore), and the reiserfs maintainer seems to have gone insane.
Historically we have kept a nice big /tmp filesystem, and not cleaned it ever, allowing developers to use it to store database files, ad hoc logs, and other persistent data. Heretical as it may be, it served us very well. Recently we transitioned to making /tmp a ram-based tmpfs filesystem (which makes it hellishly fast, but volatile), and making /var/tmp the never-wiped scratch space. It may be a better way of doing things. Time will tell.
I know many system parameters in /sys were tweaked (open file limit, and the like), but I don't know the details off the top of my head. It would be nice to collect that information into one place.